A Bootstrapping Method for Extracting Bilingual Text Pairs

نویسندگان

  • Hiroshi Masuichi
  • Raymond Flournoy
  • Stefan Kaufmann
  • Stanley Peters
چکیده

This paper proposes a method for extracting bilingual text pairs from a comparable corpus. The basic idea of the method is to apply bootstrapping to an existing corpusbased cross-language information retrieval (CLIR) approach. We conducted preliminary tests with English and Japanese bilingual corpora. The bootstrapping method led to much better results for the task of extracting translation pairs compared with a corpus-based CLIR method without bootstrapping, and the extracted translation pairs could be useful training data for improving results of the corpus-based CLIR method.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Noise-Aware Character Alignment for Bootstrapping Statistical Machine Transliteration from Bilingual Corpora

This paper proposes a novel noise-aware character alignment method for bootstrapping statistical machine transliteration from automatically extracted phrase pairs. The model is an extension of a Bayesian many-to-many alignment method for distinguishing nontransliteration (noise) parts in phrase pairs. It worked effectively in the experiments of bootstrapping Japanese-to-English statistical mach...

متن کامل

Multi-level Bootstrapping For Extracting Parallel Sentences From a Quasi-Comparable Corpus

We propose a completely unsupervised method for mining parallel sentences from quasi-comparable bilingual texts which have very different sizes, and which include both in-topic and off-topic documents. We discuss and analyze different bilingual corpora with various levels of comparability. We propose that while better document matching leads to better parallel sentence extraction, better senten...

متن کامل

Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and EM

We present a method capable of extracting parallel sentences from far more disparate “very-non-parallel corpora” than previous “comparable corpora” methods, by exploiting bootstrapping on top of IBM Model 4 EM. Step 1 of our method, like previous methods, uses similarity measures to find matching documents in a corpus first, and then extracts parallel sentences as well as new word translations ...

متن کامل

Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and E

We present a method capable of extracting parallel sentences from far more disparate “very-non-parallel corpora” than previous “comparable corpora” methods, by exploiting bootstrapping on top of IBM Model 4 EM. Step 1 of our method, like previous methods, uses similarity measures to find matching documents in a corpus first, and then extracts parallel sentences as well as new word translations ...

متن کامل

Finding small molecule and protein pairs in scientific literature using a bootstrapping method

The relationship between small molecules and proteins has attracted attention from the biomedical research community. In this paper a text mining method of extracting smallmolecule and protein pairs from natural text is presented, based on a semi-supervised machine learning approach. The technique has been applied to the complete collection of MEDLINE abstracts and pairs were extracted and eval...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000